GC content dependency of open reading frame prediction via stop codon frequencies.

نویسندگان

  • Martin Pohl
  • Günter Theissen
  • Stefan Schuster
چکیده

A frequently used approach for detecting potential coding regions is to search for stop codons. In the standard genetic code 3 out of 64 trinucleotides are stop codons. Hence, in random or non-coding DNA one can expect every 21st trinucleotide to have the same sequence as a stop codon. In contrast, the open reading frames (ORFs) of most protein-coding genes are considerably longer. Thus, the stop codon frequency in coding sequences deviates from the background frequency of the corresponding trinucleotides. This has been utilized for gene prediction, in particular, in detecting protein-coding ORFs. Traditional methods based on stop codon frequency are based on the assumption that the GC content is about 50%. However, many genomes show significant deviations from that value. With the presented method we can describe the effects of GC content on the selection of appropriate length thresholds of potentially coding ORFs. Conversely, for a given length threshold, we can calculate the probability of observing it in a random sequence. Thus, we can derive the maximum GC content for which ORF length is practicable as a feature for gene prediction methods and the resulting false positive rates. A rough estimate for an upper limit is a GC content of 80%. This estimate can be made more precise by including further parameters and by taking into account start codons as well. We demonstrate the feasibility of this method by applying it to the genomes of the bacteria Rickettsia prowazekii, Escherichia coli and Caulobacter crescentus, exemplifying the effect of GC content variations according to our predictions. We have adapted the method for predicting coding ORFs by stop codon frequency to the case of GC contents different from 50%. Usually, several methods for gene finding need to be combined. Thus, our results concern a specific part within a package of methods. Interestingly, for genomes with low GC content such as that of R. prowazekii, the presented method provides remarkably good results even when applied alone.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of Codon Usage Patterns in Toxic Dinoflagellate Alexandrium tamarense through Expressed Sequence Tag Data

We have analyzed synonymous codon usage in the genome of A. tamarense CCMP 1598 for protein-coding sequences from 10865 expressed sequence tags (ESTs). We reconstructed a total of 4284 unigenes, including 74 ribosomal protein and 40 plastid-related genes, from ESTs using FrameDP, an open reading frame (ORF) prediction program. Correspondence analysis of A. tamarense genes based on codon usage s...

متن کامل

Role of premature stop codons in bacterial evolution.

When the stop codons TGA, TAA, and TAG are found in the second and third reading frames of a protein-encoding gene, they are considered premature stop codons (PSC). Deinococcus radiodurans disproportionately favored TGA more than the other two triplets as a PSC. The TGA triplet was also found more often in noncoding regions and as a stop codon, though the bias was less pronounced. We investigat...

متن کامل

Genome-wide prediction of stop codon readthrough during translation in the yeast Saccharomyces cerevisiae.

In-frame stop codons normally signal termination during mRNA translation, but they can be read as 'sense' (readthrough) depending on their context, comprising the 6 nt preceding and following the stop codon. To identify novel contexts directing readthrough, under-represented 5' and 3' stop codon contexts from Saccharomyces cerevisiae were identified by genome-wide survey in silico. In contrast ...

متن کامل

Predicting Statistical Properties of Open Reading Frames in Bacterial Genomes

An analytical model based on the statistical properties of Open Reading Frames (ORFs) of eubacterial genomes such as codon composition and sequence length of all reading frames was developed. This new model predicts the average length, maximum length as well as the length distribution of the ORFs of 70 species with GC contents varying between 21% and 74%. Furthermore, the number of annotated ge...

متن کامل

Prokaryotic Genome Annotation Pipeline

The process of annotating prokaryotic genomes includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons, and other mobile elements. Bacterial and archaeal genomes have the considerable advantage of usually lacking introns, which subs...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Gene

دوره 511 2  شماره 

صفحات  -

تاریخ انتشار 2012